Genomic insights on carotenoid synthesis by extremely halophilic archaea Haloarcula rubripromontorii BS2, Haloferax lucentense BBK2 and Halogeometricum borinquense E3 isolated from the solar salterns of India

Haloarchaeal cultures were isolated from solar salterns of Goa and Tamil Nadu and designated as BS2, BBK2 and E3. These isolates grew with a characteristic bright orange to pink pigmentation and were capable of growing in media containing upto 25% (w/vol) NaCl. Whole genome sequencing (WGS) of the three haloarchaeal strains BS2, BBK2 and E3 indicated an assembled genomic size of 4.1 Mb, 3.8 Mb and 4 Mb with G + C content of 61.8, 65.6 and 59.8% respectively. Phylogenetic analysis based on the 16S rRNA gene sequence revealed that the archaeal isolates belong to Haloarcula, Haloferax and Halogeometricum genera. Haloarcula rubripromontorii BS2 was predicted to have 4292 genes with 4242 CDS regions, 46 tRNAs, 6 rRNAs and 3 misc_RNAs. In case of Haloferax lucentense BBK2,, 3840 genes with 3780 CDS regions were detected along with 52 tRNAs, 5 rRNAs and 3 misc_RNAs. Halogeometricum borinquense E3 contained 4101 genes, 4043 CDS regions, 52 tRNAs, 4 rRNAs, and 2 misc_RNAs. The functional annotation and curation of the haloarchaeal genome, revealed C50 carotenoid biosynthetic genes like phytoene desaturase/carotenoid 3′ -4′ desaturase (crtI), lycopene elongase (ubiA/lyeJ) and carotenoid biosynthesis membrane protein (cruF) in the three isolates. Whereas crtD (C-3′,4′ desaturase), crtY (lycopene cyclase) and brp/blh (β-carotene dioxygenase) genes were identified only in BS2.


Genome sequencing/ Illumina sequencing
The paired-end sequence library was prepared using Illumina Nextera XT DNA Library Preparation Kit.The whole genome sequencing (WGS) was performed using 151 bp paired-end sequencing protocol on Illumina platform using HiSeq4000 sequencer according to the manufacturer's instructions.Briefly, for the library preparation DNA was enzymatically fragmented by the enzyme mix provide by the Nextra XT DNA kit followed by adapter ligation of the fragments to generate sequencing libraries.Indexing adapters were ligated to the ends of the DNA fragments, preparing them for hybridization onto a flow cell.The ligated products were purified using AMPure XP beads supplied in the kit.The size-selected product was PCR amplified as described in the kit protocol.The amplified library was analysed in HiSeq4000 sequencer using High Sensitivity (HS) DNA chip as per manufacturer's instructions.After obtaining the Qubit concentration for the library, it was loaded onto HiSeq for cluster generation and sequencing.The kit reagents were used in binding of samples to complementary adapter oligos on paired-end flow cell.The adapters were designed to allow selective cleavage of the forward strands after re-synthesis of the reverse strand during sequencing.The copied reverse strand was then used to sequence from the opposite end of the fragment.

Genomic assembly annotation and bioinformatic analysis
Quality assessment of the raw fastq reads of the samples was performed using FastQC v.0.11.9 tool at default parameters 25 .The raw fastq reads were pre-processed using Fastp v.0.20.1 26 .The high-quality processed data was de novo assembled using SKESA v.2.4.0 27 .Assembly completeness and the rate of contamination was assessed using CheckM 28 .The assemblies were assessed by Quast v. 5.2.0 tool 29 .Further the generated assemblies were annotated for gene prediction using Prokka v. 1.14.6 30 .The protein sequence file generate by Prokka was used for the upcoming bioinformatic analysis and the second round of annotation was done by RAST v. 2.0 31,32 .Numerical distribution of the annotated genes was represented into different subsystems obtained in both the pie chart and list format.Protein sequence (.faa) generated by Prokka was subjected to PANNZER2 33 and BlastKOALA v.3.0 34 for functional characterization and annotation of unknown proteins.Circular genome map of the draft assembly was generated using PROKSEE 35 .The Prokka annotated proteins were further aligned against the blast databases using DIAMOND blastp v.2.0.6 36 sequence aligner and subjected to Blast2GO v.6.0.3 37 .The Gene ontology (GO) terms were subjected to WEGO v. 2.0 38 online server for enrichment of GO terms.The biosynthetic gene clusters (BGCs) respective to carotenoid biosynthetic gene cluster were predicted by antiSMASH v.7.0. 39.Presence of possible CRISPR sequences and associated Cas protein sequences were determined by CRISPERCasFinder web tool 40 .PlasmidFinder and pMLST was employed for identification of plasmid sequences, if any in the genome sequence 41 .PHASTER web tool 42,43 was used to find any prophage regions in the sequenced genomes.

Sequencing, assembly, phylogenetic analysis, and general features of the genomes
The genomic DNA libraries of the three haloarchaeal isolates were prepared by Nextera XT DNA Library Preparation Kit and the genome sequence was determined using illumine HiSeq sequencer.The Illumina-compatible sequencing library for the sample had 14,954,776 reads with an average fragment size of 853 bp for strain BS2, 14,468,630 reads with an average fragment size of 875 bp for strain BBK2 and 15,163,526 reads with an average fragment size of 788 bp for strain E3 25 and the raw fastq reads werepre-processed using Fastp v.0.20.1 26 .The genome wide de novo assembly was done using SKESA v.2.4.0 27 .The assembled genomic size of Haloarcula rubripromontorii BS2 was 4.1 Mb with N50 value of 508,287 bp, while for Haloferax lucentense BBK2 the genomic size was 3.8 Mb with N50 value of 211,896 bp and Halogeometricum borinquense E3 had a genome size of 4 Mb with N50 value of 240303 bp.The G + C content of the three haloarchaeal strains was 61.8, 65.6 and 59.8% for BS2, BBK2 and E3 respectively.It's not surprising to learn that these haloarchaeal isolates have a higher GC content in the genomic DNA as previous reports on halophilic archaea have indicated the same 51 .
Similar to bacterial cells archaea generally contain extrachromosomal plasmids responsible for various functions.However, the genomic DNA extracted from all the three strains in this study lacked plasmid sequences.It is known that CRISPR arrays, along with their Cas proteins, provide bacteria and archaea with a form of adaptive immunity, defending them against exogenous phages or plasmids.CRISPRCasFinder is a tool that enables the detection of both CRISPR arrays and Cas proteins 40 .In BS2, BBK2 and E3 strains 3, 7 and 6 possible CRISPR repeat sequences were identified.BS2 strain contained 2 possible prophage regions, whereas both BBK2 and

Genomic prediction and genome annotation
Genomic annotation analysis using Prokka tool predicted Haloarcula sp.BS2 has 22 contigs, 4292 genes with 4242 CDS regions, 46 tRNAs, 6 rRNAs and 3 misc_RNAs.In case of Haloferax sp.BBK2, 37 contigs, 3840 genes with 3780 CDS regions were detected along with 52 tRNAs, 5 rRNAs, 3 misc_RNAs and 3 repeat regions.Halogeometricum sp.E3 contained 28 contigs with 4101 genes, 4043 CDS regions, 52 tRNAs, 4 rRNAs, 2 misc_RNAS and 4 repeat regions (Table .1).The protein sequences were further aligned against databases using DIAMOND blastp v.2.0.6 for Blastp alignment and annotation, which provided gene lengths, e-values, GO IDs, enzyme code and names.Annotated protein sequences were subjected to Blast2GO v.6.0.3 37 and then GO terms were subjected to the WEGO v.2.0 38 for enrichment and plotting Gene ontology (GO) annotation data (Supplementary Fig. 3).GO statistics using WEGO represented 3,015, 2,919 and 2,881 total annotated genes for BS2, BBK2 and E3, which are divided into three classes: Biological process (1743, 1750 and 1694 respectively of annotated genes); Molecular function (2318, 2236 and 2186 respectively of annotated genes) and Cellular components (1501, 1442 and 1458 respectively of annotated genes) (Supplementary Table.1).Functional annotation by PANNZER2 tool provided information about the GO annotations and descriptive prediction of annotated genes specific for enzymes involved in pigment production and diversity among the cultures.Annotated data was also represented by performing BlastKOALA resulting in 38.3, 41.1 and 37.2% annotated output for BS2, BBK2 and E3 respectively and it assigned KEGG ontology for the annotated genes according to the functional characterization of individual genes (Fig. 3).RASTtk SEED data output of BS2 revealed 4396 number of coding sequences with 625 (15%) featured in subsystems and 3771 (85%) were not in any subsystems.Similarly in the case of BBK2 number of coding sequences were found to be 4259 with 634 (15%) featured in subsystems and 3625 (85%) were not in any subsystems, whereas in case of E3 number of coding sequences were 4246 with 612 (15%) featured in   subsystems and 3625 (85%) were not in any subsystems (Supplementary Fig. 4).For these cultures the number of coding sequences featured in a subsystem encoded for 612 (97.9%) as non-hypothetical proteins while 13 (2%) as hypothetical proteins for BS2; 625 (98.6%) as non-hypothetical while 9 (1.4%) as hypothetical for BBK2 and 610 (99.7%) as non-hypothetical whereas 11 (1.8%) as hypothetical for E3 strain.Likewise, the number of coding sequences not featured in any subsystem encoded 1021 (27%) as non-hypothetical proteins while 2750 (72.9%) as hypothetical in the case of BS2; whereas for BBK2 930 (25.7%) were non-hypothetical proteins whereas 2694 (74.3%) were hypothetical and lastly for E3 941 (25.9%) were non-hypothetical proteins and 2684 (74%) were hypothetical.antiSMASH results showed (Table .2) that all three strains possess a terpene cluster.

Bacterioruberin biosynthesis
Genomic annotations of all three cultures predicted C50 carotenoid synthesis (through terpenoid biosynthesis) via mevalonic acid (MVA) pathway.The gene geranylgeranyl diphosphate synthase type I (idsA1) responsible for the synthesis of precursor molecule geranylgeranyl pyrophosphate (GGPP) necessary for the carotenoid biosynthesis by the MVA pathway was annotated in all the three isolates BS2, BBK2 and E3.The C50 carotenoid synthesis pathway includes a 15-cis-phytoene synthase (crtB), lycopene elongase (ubiA/lyeJ) and carotenoid biosynthesis membrane protein (cruF) genes which were found in all three isolates thus confirming the C50 carotenoid synthesis.An important gene crtI here is been identified and annotated as phytoene desaturase and carotenoid 3′-4′ desaturase in all the three isolates, whereas crtD annotated as C-3′,4′ desaturase (CrtD), crtY annotated as lycopene beta cyclase (CrtY) and brp/blh annotated as β-carotene dioxygenase (Brp) were identified only in Haloarcula sp.BS2 .It has been reported that Haloarcula japonica seems to have two types desaturation reactions involving CrtI enzyme for the first reaction converting phytoene to lycopene and CrtD in the second reaction forming double bonds at C-3,4 and C-3′,4′ of the lycopene derivatives.Further both these genes have been reported to show homology for CrtD and CrtI enzymes, thus indicating their identical functionalities 7 .
According to PANNZER functional annotation, the crtI annotated as phytoene desaturase/carotenoid 3′, 4′ desaturase and crtD annotated as C-3′,4′ desaturase, thus, it could be hypothesised that Haloarcula sp.BS2 has both the desaturase enzyme (CrtD and CrtI) required in bacterioruberin C50 carotenoid biosynthesis.Whereas, in the other two isolates Haloferax sp.BBK2 Haloferax sp. and Halogeometricum sp.E3 Halogeometricum sp.only crtI gene was identified for the two desaturase enzyme activities required in the pathway.Other than crtD (C-3′,4′ desaturase), crtY and brp/blh genes encoding for lycopene cyclase and β-carotene dioxygenase respectively were also not identified in BBK2 and E3 isolates.Gene crtY is involved in β-carotene biosynthesis from lycopene and followed by retinal and bacteriorhodopsin biosynthesis.Therefore, in this study, we report that crtD, crtY and brp/bhl genes were found to be specific only for Haloarcula sp.BS2 (Table 3).Available studies report that only a few species of Haloarcula 7 and Halobacterium are able to synthesise bacteriorhodopsin, which has also shown to inhibit bacterioruberin synthesis 52 .Haloferax and Halogeometricum species are not known to synthesize bacteriorhodopsin and therefore accumulate only bacterioruberin.Thus, our data indicates that there is a diversity within the haloarchaeal species with regards to bacterioruberin synthesis and could be one of the reasons for the variation in shades of red, orange and pink colonies among these species 53 .
The genes thus involved in carotenogenesis in Haloarcularubripromontorii BS2, Haloferax lucentense BBK2 and Halogeometricum borinquense E3, have been depicted through a biosynthetic pathway in Fig. 4. The heat map representing these annotated genes is shown in Supplementary Fig. 5.The antiSMASH output predicted the three strains BS2, BBK2 and E3 to have two biosynthetic gene clusters (BGCs) responsible for the secondary metabolites of terpene.ClusterBlast identified that two BGCs showed similarities in gene cluster of Haloarcula vallismortis DSM 3756 T for BS2; Haloferax volcanii DS2 T for BBK2 and Halogeometricum borinquense DSM 11551 T for E3.According to KnownClusterBlast, these three isolates share no similarity with any known pathways.The phytoene synthase gene (crtB) was represented by the core biosynthetic gene of terpene BGCs in all the three strains.Nucleotide locations of both the BGCs associated with carotenoid biosynthesis of each of the strains were identified and represented in the Table .2. www.nature.com/scientificreports/Haloarchaeal isolates express brilliant red -orange pigment attributed to bacterioruberin.These unique C50 pigments provide membrane stability, UV protection, ROS scavenging, etc., to the halophilic archaea [13][14][15][16] .Haloarchaea inhabit extreme hypersaline econiches and are known to produce various potential novel bioactive compounds.C50 carotenoids structurally exhibit better antioxidative properties due to their extended conjugated double bonds and the inclusion of at least one hydroxyl group 17,18 .Consequently, this uncommon class of carotenoids is appealing for various biological and industrial applications.Our sequencing data revealed the genes related to C50 carotenoid synthesis and diversity within the biosynthetic pathway among the haloarchaeal cultures.In-depth genomic analysis of the C50 carotenoid pathway in haloarchaeal species is not much elucidated.This study thus focused on giving insights into the genomic characterization of different species of halophilic archaea and the hypothetical pathway involved in the C50 carotenoid biosynthesis in them.This will open new avenues by promoting whole genome guided bioprospecting approach to carotenogenesis and other biotechnological applications of haloarchaeal species.

Fig. 1 .
Fig. 1.Circular Genome Plot of Haloarcula rubripromontorii BS2 (a); Haloferax lucentense BBK2 (b) and Halogeometricum borinquense E3 (c).A graphical circular map of the genome was performed with Proksee tool.The outermost ring 1 (Forward strand) and ring 2 (Reverse strand), show the protein coding regions.The 3rd Ring in Grey, shows the contigs.Ring 4 shows the GC skew and ring 5 shows the GC content.

Table 2 .
Distribution of BGCs respective to terpene and carotenoid biosynthesis gene clusters of strains BS2; BBK2 and E3 detected by antiSMASH 2.0 server.

Table 3 .
Functionally annotated genes of terpenoid backbone and carotenoid biosynthesis pathways for strains.BS2; BBK2 and E3 with KEGG number, involved in the different reported terpenoid backbone and carotenoid biosynthesis pathways.